Accurate prediction of RNA secondary structure including pseudoknots through solving minimum-cost flow with learned potentials

Pseudoknots are key structure motifs of RNA and pseudoknotted RNAs play important roles in a variety of biological processes. Here, we present KnotFold, an accurate approach to the prediction of RNA secondary structure including pseudoknots. The key elements of KnotFold include a learned potential function and a minimum-cost flow algorithm to find the secondary structure with the lowest potential. KnotFold learns the potential from the RNAs with known structures using an attention-based neural network, thus avoiding the inaccuracy of hand-crafted energy functions. The specially designed minimum-cost flow algorithm used by KnotFold considers all possible combinations of base pairs and selects from them the optimal combination. The algorithm breaks the restriction of nested base pairs required by the widely used dynamic programming algorithms, thus enabling the identification of pseudoknots. Using 1,009 pseudoknotted RNAs as representatives, we demonstrate the successful application of KnotFold in predicting RNA secondary structures including pseudoknots with accuracy higher than the state-of-the-art approaches. We anticipate that KnotFold, with its superior accuracy, will greatly facilitate the understanding of RNA structures and functionalities.

1 Supplementary results and discussions

Experimental settings for secondary structure prediction methods
To mitigate biases introduced by training data and ensure a fair comparison, we trained models of KnotFold, MXfold2, and UFold on the constructed training set with Intel CPU version Xeon E5-2690 v4 (2.60G Hz, 64GB memory) and GPU version Tesla V100 PCIe (16GB memory).We employed the default hyperparameters to train MXfold2 for 10 epochs and UFold for 100 epochs.We evaluated the performance of KnotFold and eight other approaches over PKTest, including SPOT-RNA, MXfold2 (the released and the retrained versions), UFold (the released and the retrained versions), RNAstructure, ProbKnot, IPknot, Knotty, and pKiss.For pKiss, we utilized the enforced mode instead of the mfe mode for generating structures containing pseudoknots.The other approaches were run with default experimental settings.
We also evaluated the performance of BiokoP [1], which utilizes a biobjective integer programming algorithm to optimize both energy and probabilistic criteria in RNA secondary structure prediction.Regrettably, BiokoP could only predict the secondary structure of a small subset of the RNAs (86 out of 1305 on TS0, for example), accounting for less than 10% of the dataset, and the average accuracy of BiokoP over reported structures reached a middle level (0.535) among these RNA structure prediction approaches.As a result, we excluded the performance of BiokoP from our analysis.

Extending KnotFold to identify base triples
Besides base pairs, an RNA might also form base triples [2], which involve three bases interacting edge-to-edge by hydrogen bonding.Supplementary Figure 3 shows the secondary structure of the RNA with PDB entry 5L4O, which contains three base triples, 25C-10G-45G (in magenta), 13C-22G-46A (in green), and 37A-29G-41C (in blue).Previous studies have reported the importance of base triples in RNA structures and functions [3,4].
The conventional dynamic programming algorithms, however, do not allow for base triples when constructing secondary structures.In contrast, Knot-Fold is capable to predict secondary structures including base triples with a slight modification without any change of its essence.In particular, we enhance KnotFold by changing the edge capacity from 1 to 2, which allows a base to interact with 2 other bases, thus forming base triples.
As shown in Supplementary Figure 3, the original KnotFold missed the base pair 25C-10G (shown as magenta dashed line), and thus failed to identify the base triple 25C-10G-45G.Similarly, the absence of base pairs 22G-46A and 29G-41C leads to the failure in identifying the other two base triples 13C-22G-46A and 37A-29G-41C.In contrast, the enhanced KnotFold successfully identified all three base triples, and thus correctly predicted the secondary structure for this RNA.A similar result is illustrated for another RNA 7LYJ in Supplementary Figure 4.
To benchmark KnotFold's performance for base triple prediction, we utilize datasets collected from PDB by SPOT-RNA.The datasets include a training set TR1, a validation set VL1, and two test sets TS1 and TS2, which contain 104, 27, 53, and 29 RNAs with base triples, respectively (see Supplementary Table 12 for details).The training set TR1 has limited number of sequences, and it alone is insufficient for training.To overcome this difficulty, we employed the transfer learning strategy as performed by SPOT-RNA.In particular, we first trained KnotFold using bpRNA and Rfam 14.5 dataset, and then fine-tuned the model using TR1.
We assessed KnotFold and SPOT-RNA on the test sets TS1 and TS2.As shown in Supplementary Table 13, KnotFold achieves an accuracy of 0.739 on TS1, higher than SPOT-RNA by 0.025.The predicted accuracy of KnotFold for base triples reaches 0.239, higher than SPOT-RNA by 0.047.The advantage of KnotFold over SPOT-RNA is much clearer on the TS2 dataset.These results clearly illustrate KnotFold's capacity in predicting RNA base triples.

Analysis on ensemble strategy
In the paper, an ensemble of five randomly initialized models is used to increase robustness against overfitting in the calculation of base-pairing probabilities.Each model has the same network structure, but differs only in the initial values of the parameters.It is crucial to understand the accuracy of each model and how the ensemble strategy affects the predictions.
To investigate the ensemble strategy, we carried out two experiments, one using five models with the same network architecture but different initial parameters, and the other using five models with different network architectures, i.e., the number of Transformer layers, and the embedding dimension.The network architectures as well as the experimental results are summarized in Supplementary Table 5 and 6.From these tables, we achieved the following observations: • In the case that five models share an identical network architecture but different initial parameters, the individual models exhibit similar performance.For example, on the validation set containing 1,131 RNAs, the F1 scores of the five models are 0.715, 0.711, 0.712, 0.717, and 0.712, respectively.The ensemble of these five models exhibits an F1 score of 0.724, slightly higher than the best-performing individual model by 0.007.• In the case that five models have different network architectures, their performance showcases relatively large variation.For example, on the validation set, the five models achieve an F1 score of 0.721, 0.714, 0.717, 0.720, and 0.709, respectively.The ensemble of these five models exhibit an additional improvement in accuracy, achieving an F1 score of 0.732, even higher than the ensemble with the same network structure by 0.008.
We further repeated these experiments on a sequence-level test set PKTest (containing 1,009 pseudoknotted RNAs) and a family-level test set RfamNew (containing 472 RNAs extracted from the 168 RNA families newly added to Rfam 14.9) and observed similar phenomena (see Supplementary Table 5 and 6 for further details).
Together, these results suggest that, compared with the models sharing identical network architecture, the models using different architectures can yield better ensemble models.These observations are consistent with the view point by T G Dietterich [5], i.e., different network architectures introduce more varieties to the individual models, which will facilitate the ensemble strategy.

Assessing the prediction accuracy of base pair probability
We evaluate the accuracy of the base pair probability directly generated by the neural networks.For the sake of fair comparison, we used the same evaluation metric as SPOT-RNA, i.e., treating the task as a binary classification problem (forming base pairs or not) on the upper triangle probability matrix.As illustrated by Supplementary Table 7, the F1 score of the base pair probability generated by KnotFold's prior model reaches 0.729 on the PKTest dataset, higher than SPOT-RNA by 0.161.Similar results can be found when comparing KnotFold with UFold as well as its retrained version.The results highlight the advantage of KnotFold in predicting base pairing probabilities.We also investigated the power of self-attention network used by KnotFold in predicting pseudoknotted base pairs on PKTest.As shown in Supplementary Table 7, KnotFold achieves an F1 score of 0.697, 0.771, and 0.687 for pseudoknot-free, crossing-pseudoknot, and pseudoknotted base pairs, respectively.In contrast, SPOT-RNAachieved the F1 score of 0.552, 0.596, and 0.226, respectively.These results highlight the advantages of its self-attention network architecture used by KnotFold in predicting pseudoknotted base pairs.Furthermore, we evaluated whether KnotFold's self-attention networks have advantage in the identification of long-distance base pairs.Here, similarly to Ref. [6], we defined "long-distance base pairs" as these pairs at least 101 bases apart in sequence, and we adopted the average accuracy for long-distance base pairs as our evaluation criteria.
As shown in Supplementary Table 7, on PKTest, KnotFold achieves the accuracy of long-range interactions of 0.535, significantly higher than SPOT-RNA (0.285).We observed similar phenomena from the comparison on the validation set.These results clearly demonstrate KnotFold's capacity in predicting long-distance base pairs.

Removing the tentatively-selected base pairs during the execution of KnotFold
To identify the base pairs with the lowest potential, the minimum-cost flow algorithm used by KnotFold sometimes removes certain base pairs that have tentatively been selected.KnotFold accomplishes this objective using "backward flows", a unique feature of the network flow algorithm.Supplementary Figure 5 shows one iteration step of KnotFold when predicting secondary structure for BA000035.2_1474355-1474472.After selecting 22 base pairs, the minimum-cost flow algorithm reports a tentative secondary structure with a cost of -379.829.At this stage, the minimum-cost flow algorithm found an improvement path that contains two backward edges 15G-40C and 23G-36C (shown as blue lines).After removing the base pairs corresponding to these two edges, KnotFold obtained an improved secondary structure with a lower potential of -379.831.This removing operation is key to the success of KnotFold in acquiring the optimal secondary structure.

Optimal setting of the parameter λ in potential function
As shown in Equation 1, the parameter λ is set to control the number of base pairs expected to appear in the predicted RNA secondary structure.We set the optimal λ as 4.2 such that the average number of base pairs expected to appear in the predicted RNA secondary structure is close to that in the ground-truth RNA secondary structure in the training set.The setting of λ also achieves a trade-off between precision and recall of predictions through setting an appropriate λ in the potential function used by KnotFold.As demonstrated in Supplementary Fig. 6a, when set λ as 4.2, the precision and recall over valid set also achieve a balance.Since the setting of λ is highly related to the number of base pairs, we can also modify the structural potential function (shown in Equation 1) and regard λ as a function of the number of base pairs.The high correlation between the length of the RNA sequence and the number of expected base pairs enables us to appropriately set λ prior to running KnotFold according to RNA length.More specifically, the optimal λ exhibits a highly negative linear correlation with RNA length (Pearson correlation coefficient: 0.988, shown in Supplementary Fig. 6b).
We have implemented an adaptive version of the minimum-cost flow algorithm to derive the λ value according to the length of the RNA.Specifically, according to the observation of the negative linear correlation between the optimal λ value and RNA length in Supplementary Fig. 5b, we implemented a new version of the minimum-cost flow algorithm and set λ to 4.80, 4.68, 4.60, 4.15, 3.86, 2.05, and -0.70 for RNAs with lengths falling within the ranges of 1-100, 101-200, 201-300, 301-400, 401-500, 501-1000, and longer than 1000, respectively.The results of this algorithm are very promising: its accuracy is 0.763 on PKTest dataset, higher than the original version (0.758).
As the RNAs in PKTest dataset are relatively short with length less than 500 bases, we further examined the improvement of this algorithm on long RNAs.For the 49 RNAs in the validation set with length exceeding 500 bases, the accuracy increased from 0.119 to 0.207.These results demonstrate the potential of the adaptive λ approach to improve the accuracy of secondary structure predictions, particularly for long RNA sequences.

Investigate KnotFold's robustness using TestSetA and TestSetB
To investigate the robustness of our models, we validate them on datasets that have been widely used in previous studies.We conduct experiments following the settings employed in TORNADO [7].Specifically, we retrained KnotFold using TrainSetA (containing 3,166 RNAs), and tested it on TestSetA (containing 592 RNAs) and TestSetB (containing 430 RNAs).We summarized the evaluation results in Supplementary Table 9.For the sake of completeness, we also listed the performance of other five approaches on these datasets, including TORNADO, ContextFold, CONTRAfold, MXfold, and MXfold2.As a result, the retrained KnotFold achieves an F1 score of 0.728 on TestSetA and 0.551 on TestSetB, which are comparable with the performance of TORNADO (0.746 on TestSetA, 0.552 on TestSetB) and other four approaches.These findings illustrate the robustness and effectiveness of our models in accurately predicting secondary structures on widely-used datasets, even in the absence of pseudoknots.

Evaluate the performance of secondary structure prediction at lower sequence identity
Sequence identity is critical for judging whether there is substantial overlap between training set and test set.Here, we conducted a series of experiments using both CD-HIT [8] and MMseqs2 [9] to assess the KnotFold's performance at different sequence identity cut-off (0.80 and 0.40), and we list the results in Supplementary Table 10.
In the manuscript, we applied the sequence identity cut-off of 0.80 as the software tool used to construct training, validation, and test sets, CD-HIT, supports the sequence identify cut-off within the range [0.80, 1.00] only.Under this setting, we obtained 25,959 clusters from the non-redundant RNA sequences collected in bpRNA-1m and Rfam 14.5 databases, and further randomly divided these sequences into training set (23,819 sequences), validation set (1,131 sequences), and test set (1,009 sequences, denoted as PKTest).
To assess the performance of KnotFold and other approaches under the sequence identity cut-off of 0.40, we used MMseqs2, a widely-used tool in clustering protein sequences for efficient searching, instead of CD-HIT to cluster sequences and then construct data sets.We assessed the approaches using MMseqs2 as sequence clustering tool under various sequence identity cut-off, and summarized the results below: • Sequence identity cut-off of 0.80: Under this setting, we acquired 27,799 clusters, about 7% more than that reported by CD-HIT.Next, we removed the 1,123 clusters that overlap with PKTest, and randomly divided the left-over into training set (25,676 sequences) and validation set (1,000 sequences).
We retrained KnotFold using the new training set and observed that, the acquired KnotFold model achieves an F1 score of 0.761, which is very close to the performance acquired using CD-HIT as clustering tool with the same sequence identity of 0.80.• Sequence identity cut-off of 0.40: Under this setting, MMseqs2 produced 15,439 clusters.After removing the 793 clusters that overlap with the PKTest dataset, we randomly divided the left-over into a training set (13,646 sequences) and a validation set (1,000 sequences).We again retrained KnotFold using this training set and observed that the acquired KnotFold model achieved an F1 score of 0.512 on PKTest, lower than that acquired using the cut-off 0.80 by 0.249.We also assessed MXfold2 under this setting.The retrained MXfold2 model acquired using the same procedure achieved an F1 score of 0.475 on PKTest, also lower than that acquired using the cut-off of 0.80.We conjectured that the underlying reason might be the decrease of the training set size with the sequence identity cut-off.

Enhancing the predictions of SPOT-RNA using the minimum-cost flow algorithm
We investigated whether the minimum-cost flow algorithm could further enhance the predictions of the base pair probability generated by other approaches.Here, we built a variant of SPOT-RNA called "SPOT-RNA enhanced with minimum-cost flow", which predicts secondary structure by applying the minimum-cost flow on thebase pair probabilities reported by SPOT-RNA.Specifically, we constructed a structural potential (as shown in Equation 1) utilizing the output probabilities of SPOT-RNA.By employing these probabilities as priors and incorporating a constant reference, we apply the same minimum-cost flow algorithm to construct the secondary structure with the lowest potential.
As shown in Supplementary Table 8, the "SPOT-RNA enhanced with minimum-cost flow" variant reaches an accuracy of 0.590, showing an improvement of 0.011 compared to the performance of the original version of SPOT-RNA.These results demonstrate that the minimum-cost flow algorithm can be used to enhance SPOT-RNA to predict secondary structure.

Reformulating the potential function as the total cost of a flow
The RNA-specific structural potential we construct (Equation 1) can be reformulated as follows:  1−P (bpi,j |length) , which is a constant independent of structure S. Interestingly, we observe that although the potential is calculated by adding up the contributions of all possible pairs of bases, it is essentially determined by the base pairs appearing in the secondary structure only, as the first term illustrated.This observation is consistent with conventional thermodynamic prediction methods that only consider the contribution of base pairs when calculating free energy [10][11][12].
By leveraging this observation, we transform the goal of finding the secondary structure with the lowest potential into calculating the minimum-cost flow in an appropriately-designed network.In particular, the network consists of a bipartite, and each part of the bipartite consists of L nodes that represent the L bases of the target RNA.We connected every node in the left part to every node in the right part with an edge, which essentially represents a possible base pair.For the edge (i, j) connecting the i-th base and the j-th base, we set its capacity as 1 and its cost as H(i, j).This way, the potential exactly describes the total cost of a flow in the constructed graph network, and the secondary structure with the lowest potential corresponds to the minimum-cost flow.

The construction of the variant KnotFold-DP
We further developed a variant of KnotFold (called KnotFold-DP), which is particularly designed for non-pseudoknot RNAs.KnotFold-DP employs the dynamic programming algorithm instead of the minimum-cost flow algorithm to determine the optimal secondary structure.For the potential function we previously constructed (Equation 1), the minimum-cost flow algorithm constructs an arbitrary secondary structure with the lowest potential, as is demonstrated above.However, when restricting RNAs to nested base pairs only, i.e. non-pseudoknot RNAs, a dynamic programming algorithm could be applied to construct the lowest potential structure.
The dynamic programming algorithm used by KnotFold-DP solves the following recursive equation: given the calculated pairwise potentials H(i, j), the structure potential from the i − th base to the j − th base E i,j is calculated as below: This simple DP strategy constructs a secondary structure with the lowest potential for non-pseudoknot RNAs.

The construction of the variant KnotFold-LP
We have performed an additional variant of KnotFold (called KnotFold-LP), which employs linear programming technique used in UFold [13] instead of minimum-cost flow to construct secondary structures.Using the same base pair probability as KnotFold, the linear programming algorithm constructs secondary structure from base pair probabilities with several strict constraints, and it determines the final base pairs by using an iterative grid search strategy.
As listed in Supplementary Table 8, KnotFold combinded with linear programming technique achieves an F1 score of 0.705 on PKTest, lower than the performance of the combination with the minimum-cost flow technique by 0.053.This comparison clearly illustrates the advantages of the minimum-cost flow algorithm over the linear programming approach used by UFold.

Supplementary Fig. 1 Supplementary Fig. 9
The predicted secondary structures by KnotFold on two RNAs with K-and L-type pseudoknots.a, The predicted structure for the RNA URS0000D69A17_12908_1-124 with the K-type pseudoknot.b, The predicted structure for the RNA URS0000D67D31_12908_1-61 with L-type pseudoknots.In both cases, KnotFold can successfully identify these pseudoknots.Here, magenta dash lines represent the missing base pairs and blue lines represent the false-positive base pairs The running time of the minimum-cost flow algorithm for RNAs with different length in the validation set.The running time is evaluated on 1,131 RNAs in the validation set.The minimum-cost flow algorithm constructs the secondary structures within 20 seconds for RNAs no longer than 1,000 bases, and within 50 seconds for RNAs no longer than 2,000 bases [12] Mathews, D.H., Andre, T.C., Kim, J., Turner, D.H., Zuker, M.: An Updated Recursive Algorithm for RNA Secondary Structure Prediction with Improved Thermodynamic Parameters.ACS Publications (1998) [13] Fu, L., Cao, Y., Wu, J., Peng, Q., Nie, Q., Xie, X.: UFold: Fast and Accurate RNA Secondary Structure Prediction with Deep Learning.Nucleic Acids Research 50(3), 14-14 (2022)